# Parallel Computing with GPUs: CUDA Assignment Project Exam Help Performance https://powcoder.com

Dr Paul Richmond http://paulrichmond.sher.ac.uk/teaching/COM4521/





- ☐Global Memory Coalescing
- ☐Global Memory Coalescing with the L1 Cache
- Occupancy and Thread Block Dimensions
  Assignment Project Exam Help

https://powcoder.com

Add WeChat powcoder





# Coalesced Global Memory Access

☐ When memory is loaded/stored from global memory to L2 (and L1) it is moved in cache lines ☐ If threads within a warp access global memory in irregular patterns this can cause increased movement (transactions) of data Help

Coalesced access is where sequential threads in a warp access sequentially adjacent 4 type words (e.g. 4010) at or int values). Having coalesced access will reduce the number of cache lines moved and increase memory performance ☐ This is one of the most important performance considerations of GPU memory usage!





# Use of Memory Cache Levels



#### L2 Coalesced Memory Access



```
__global__ void copy(float *odatt, float* idata).com
int xid = blockIdx.x * blockDim.x + threadIdx.x;
odata[xid] = idata[xid]; Add WeChat powcoder
}
```

- ☐Global memory always moves through L2
  - ☐But not always through L1 depending on architecture
- ☐ In L2 cache line size is 32B
  - ☐ For a coalesced read/write within a warp, 4 transactions required
  - □100% memory bus speed





# L2 Permuted Memory Access



https://powcoder.com

Add WeChat powcoder

- ☐ Permuted Access
  - ☐ Within the cache line accesses can be permuted between threads
  - No performance penalty





#### L2 Permuted Memory Access



https://powcoder.com

Assignment Project Exam Help

Add WeChat powcoder

- ☐ Permuted Access
  - ☐ Permuted access within 128 byte segments is permitted
    - ☐Will NOT cause multiple loads
    - ☐ Must not be permuted over the 128 byte boundary





# L2 Offset Memory Access



Assignment Project Exam Help

```
__global___ void copy(float *odatt, float* idata).com
int xid = blockIdx.x * blockDim.x + threadIdx.x + OFFSET;
odata[xid] = idata[xid]; Add WeChat powcoder
}
```

- ☐ If memory accesses are offset then parts of the cache line will be unused (shown in red) e.g.
  - ☐ 5 transactions of 160B of which 128B is required: 80% utilisation
- ☐ Use thread blocks sizes of multiples of 32!





# L2 Strided Memory Access



#### Assignment Project Exam Help

```
__global___ void copy(float *odatp, float* foata).com
int xid = (blockIdx.x * blockDim.x + threadIdx.x)* STRIDE;
odata[xid] = idata[xid]; Add WeChat powcoder
}
```

 $\square$  How many cache lines transactions for warp if STRIDE = 2?





# L2 Strided Memory Access



#### Assignment Project Exam Help

```
__global__ void copy(float *odatp, floata) rodata.x * blockDim.x + threadIdx.x) * STRIDE;
odata[xid] = idata[xid]; Add WeChat powcoder
}
```

- ☐Strided memory access can result in bad performance e.g.
  - ☐ A stride of 2 causes 8 transactions: 50% useful memory bandwidth
  - ■As stride of >32 causes 32 transactions: ONLY 3.125% bus utilisation!
    - ☐ This is as bad as random access
    - ☐ Transpose data if it is stride-N





#### Degradation in Strided Access Performance



□Note: Performance worsens beyond a stride of just 8 as adjacent or concurrent warps (on same SM) can't re-use cache lines from L2







#### Array of Structures vs Structures of Arrays

- ☐ Array of Structures (AoS)
  - ☐ Common method to store groups of data (e.g. points)

```
struct point {
    float x, y, z;
};
    Assignment Project Exam Help
__device__ struct point d_points[N];

__global__ void manipulate_https://powcoder.com

{
    float x = d_points[blockIdx_x*blockDim.x + threadIdx_x].x;
    float y = d_points[blockIdx.x*blockDim.x + threadIdx_x].x;
    float z = d_points[blockIdx.x*blockDim.x + threadIdx.x].z;

    func(x, y, z);
}
```

Is this a good kernel?





#### Array of Structures vs Structures of Arrays

- ☐ Array of Structures (AoS)
  - □Common method to store groups of data (e.g. points)

```
struct point {
    float x, y, z;
};
    Assignment Project Exam Help
    __device__ struct point d_points[N];

__global__ void manipulate https://powcoder.com
{
    float x = d_points[blockIdx.x*blockDim.x + threadIdx.x].x;
    float y = d_points[blockIdx.x*blockDim.x + threadIdx.x].x;
    float z = d_points[blockIdx.x*blockDim.x + threadIdx.x].x;
    func(x, y, z);
}
```





#### Array of Structures vs Structures of Arrays

☐ An Alternative: Structure of Arrays (SoA)

```
struct points {
    float x[N], y[N], z[N];
};

__device__ struct points Assignment Project Exam Help

__global__ void manipulate_points()
{
    float x = d_points.x[blockIdx.x*blockDim.x + threadIdx.x];
    float y = d_points.y[blockIdx.x*blockDim.x + threadIdx.x];
    float z = d_points.z[blockIdx.x*blockDim.x + threadIdx.x];
}
```

100% effective memory bandwidth







- ☐Global Memory Coalescing
- ☐Global Memory Coalescing with the L1 Cache
- Occupancy and Thread Block Dimensions
  Assignment Project Exam Help

https://powcoder.com

Add WeChat powcoder



| ☐ What affect does this have on performant                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | ce of memory movement?     |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------|
| ☐ Can be good in certain circumstances                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                            |
| Coalesced access with adjacent warps reading san                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | ne data                    |
| Can also be bad                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | van TTalia                 |
| Un-coalesced acces plestomente Project Exa                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | ım Heip                    |
| Increases over-fetch                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                            |
| Does my card support https://prescherge                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | )111<br>                   |
| Check globalL1CacheSupported and localL1CacheSupported and localL1Cach | ache Supported CUDA device |
| ☐ Maxwell 5.2 reports globalL1CacheSupported false                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | e when in fact true!       |
| ☐ Enabling L1 caching of global loads                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                            |
| ☐Pass the -Xptxas -dlcm=ca flag to nvcc                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | at compile time            |
| □-dlcm=cg can be used to disable L1 on devices whi                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | ich use it by default      |





# Enabling L1 Cache in Visual Studio







# L1 Coalesced Memory Access



```
__global__ void copy(float *oaad WeChatapowcoder
  int xid = blockIdx.x * blockDim.x + threadIdx.x;
  odata[xid] = idata[xid];
}
```

- □All addresses fall in one 128B cache line
  - ☐ Single transaction
- □100% bus utilisation





# L1 Permuted Memory Access



- ☐Any thread within the warp can permute access
- ☐Same as L2





# L1 Offset Memory Access



Assignment Project Exam Help

https://powcoder.com

```
__global___void copy(float *odata float* idata)
int xid = blockIdx.x * blockIdm.x * that applies.codeffset;
odata[xid] = idata[xid];
}
```

- ☐ If memory accesses are offset then parts of the cache line will be unused (shown in red) e.g.
  - ☐ 2 transactions of 256B of which 128B is required: 50% utilisation
- ☐ For strided and random access performance is much worse with L1





- ☐Global Memory Coalescing
- ☐Global Memory Coalescing with the L1 Cache
- Occupancy and Thread Block Dimensions Assignment Project Exam Help

https://powcoder.com

Add WeChat powcoder



# Occupancy

- □Occupancy is the ratio of active warps on an SMP to the maximum number of active warps supported by the SMP
  - □Occupancy = Active Warps / Maximum Active Warps

#### Assignment Project Exam Help

- ☐Why does it vary?
  - https://powcoder.com

    Resources are allocated at thread block level and resources are finite
  - Multiple thread blocks and beas signed power Streaming Multi Processor
  - ☐Your occupancy might be limited by either
    - 1. Number of registers
    - Shared memory usage
    - Limitations on physical block size





# Why is occupancy important

□ Implications of Increasing Occupancy
□ Memory bound code
□ Higher occupancy will hide memory latency
□ If bandwidth is less than peak then increasing active warps might improve this Assignment Project Exam Help
□ Compute bound code
□ Will not improve performance/powcoder.com
□ 100% occupancy not required for maximum performance
□ Instruction throughput Aight Worthalt powcoder
□ Memory bandwidth might be fully saturated



#### Occupancy and Thread Block Size

☐ Thread Block Limitations □Always a factor of 32 (warp size) ☐ Changing the thread block size will change occupancy If thread block size is stigns ment Project Exam Help ☐ There is a fixed limit on the number of active thread blocks per SM (16 in Kepler, 32 in https://powcoder.com Maxwell) □ If thread block is too big
□ Not enough resources for another block
□ Not enough resources for another block ☐ Block is stalled until enough resource is available ☐ The relationship between thread block size and occupancy is non linear □ Complex interplay between resources





# Occupancy Calculator

- ☐ The CUDA Occupancy calculator is available for download
  - http://developer.download.nvidia.com/compute/cuda/CUDA Occupancy calculator.xls
  - By giving a Compute Capability. Threads per block usage registers per thread and shared memory per block occupancy can be predicted.
  - It will also inform you what factoris limiting of upancy (registers, SM, block size)











#### Occupancy Calculator

- ☐ How do I know what my SM usage per block is?
  - it or dynamically requested it as a kernel argument
- ☐ How do I know what my register usage is?
  - ☐CUDA build rule Device Properties-> Verbose PTX Output = Yes
    - ☐i.e. nvcc –ptxas-options=-v

```
1>----- Build started: Project: Lab06, Configuration: Debug Win32 -----
         1> Compiling CUDA source file kernel.cu...
         1> E:\Lab06>"nvcc.exe" -gencode=arch=compute_35,code=\"sm_35,compute_35\" --use-local-env --cl-
         version 2013 -ccbin "C:\MSVS12.0\VC\bin" -I"C:\CUDA\v7.0\include" -I"C:\CUDA\v7.0\include" -G
         -keep-dir Debug -maxrregcount=0 --ptxas-options=-v --machine 32 --compile -cudart static -Xptxas -
         dlcm=ca -g -D_CUDACC_ -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od
         /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "E:\Lab06\kernel.cu"
         1> ptxas info
                           : 0 bytes gmem
         1> ptxas info
                          : Function properties for cudaGetDevice
              Partes stack frame, 0 bytes spill stores, 0 bytes spill loads bytes into Function properties for adaFuncGetAttributes
                 8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
                        : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
        13 // 24 bytes stalk frame, 0 bytes spill stores, 0 bytes spill loads
https://ppoiw.co.derioco.phlties for cudaDeviceGetAttribute
                16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
                           : Compiling entry function ' Z9addKernelPiS S ' for 'sm 35'
         1x xptxas info _ __: Function properties for _Z9addKernelPiS_S_
              Clout data below, 6 but Copill stores, 0 bytes spill loads
         1> ptxas info : Used 6 registers, 332 bytes cmem[0]
         1> ptxas info
                        : Function properties for cudaMalloc
                8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
                           : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessor
            ptxas info
                16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
            kernel.cu
            Lab06.vcxproj -> E:\Lab06.exe
            copy "C:\CUDA\v7.0\bin\cudart*.dll" "E:\Lab06\Debug\"
            C:\CUDA\v7.0\bin\cudart32 70.dll
         1> C:\CUDA\v7.0\bin\cudart64 70.dll
                    2 file(s) copied.
         ======= Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ========
```





# Intelligent Launching

□Since CUDA 6.5 It is possible to launch block sizes to maximise occupancy (using the Occupancy API)
□This does not guarantee good performance!
□However: Usually A good compart of the Exam Help
□cudaOccupancyMaxPotentialBlockSize: will find best block size and minimum grid size
□Actual grid size must be calculated hat powcoder.

```
int blockSize;
int minGridSize;
int gridSize;

Static SM size

cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, MyKernel, 0, 0);
gridSize = (arrayCount + blockSize - 1) / blockSize; //round up

MyKernel <<< gridSize, blockSize >>> (d_data, arrayCount);
```





# Occupancy SDK for Shared Memory

☐ What if SM use varies depending on block size?

```
int SMFunc(int blockSize) {
 return blockSize*sizeof(int);
                            Assignment Project Exam Help
                                 https://powcoder.com
void launchMyKernel(int *d data, int arrayCount)
                                 Add WeChat powcoder
 int blockSize;
 int minGridSize;
 int gridSize;
 cudaOccupancyMaxPotentialBlockSizeVariableSMem(&minGridSize, &blockSize, MyKernel, SMFunc, 0);
 gridSize = (N + blockSize - 1) / blockSize;
 MyKernel <<< gridSize, blockSize, SMFunc(gridSize) >>>(d data, arrayCount);
```





#### Other considerations for block sizes

- Waves and Tails
  - ☐ A Wave is a set of thread blocks that run concurrently on the device
    - ☐ Grid launch may have multiple waves
  - A Tail is the partial thread block left as a result of dividing problem size by thread block dimensions
- ■Performance Implications://powcoder.com
  - Larger thread blocks sizes may result in inefficient execution Add WeChat powcoder







#### Other considerations for block sizes

#### Assignment Project Exam Help

https://powcoder.com



# Summary

☐ For best memory bandwidth coalesced access is very important ☐ Care should be taken to avoid unnecessary offsets or strides which degrade memory performance □ Structures of Arrays signification of Structures L1 cache can be good fortim proving the formance but will reduce performance when strided or random access patterns are used Add WeChat powcoder

Occupancy is a measure which can be used for improving the performance of memory bound code ☐ Large thread blocks might be good for occupancy but introduce large tails (benchmarking to balance tradeoff is therefore crucial!)





# Acknowledgements and Further Reading

☐ Cache line sizes ☐GPU Performance Analysis □http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-PerAccinenta Preisiect Exam Help Waves and Tails (<a href="https://powcoder.com/https://powcoder.com/demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-demand.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gputechconf.gpu GPU-Performance-Analysia W Chat powcoder ☐ How to enable use of L1 cache □http://acceleware.com/blog/opt-in-L1-caching-global-loads ☐ Better Performance at Lower Occupancy http://nvidia.fullviewmedia.com/gtc2010/0922-a5-2238.html



